Introduction

The nice thing about reproducible data analysis (like I'm trying to do it here on my blog) is, well, that you can quickly reproduce or even replicate an analysis.

So, in this blog post/notebook, I transfer the analysis of "Developers' Habits (IntelliJ Edition)" to another project: The famous open-source operating system Linux. Again, we want to take a look at how much information you can extract from a simple Git log output. This time we want to know

  • where the developers come from
  • on which weekdays the developers work
  • what the normal working hours are and
  • if there is any sight of overtime periods.

Because we use an open approach for our analysis, we are able to respond to newly created insights. Again, we use Pandas as data analysis toolkit to accomplish these tasks and execute our code in a Juypter notebook (find the original on GitHub.ipynb"). We also see some refactorings by leveraging Pandas' date functionality a little bit more.

So let's start!

Gaining the data

I've already described the details on how to get the necessary data in my previous blog post. What we have at hand is a nice file with the following contents:

1514531161 -0800    Linus Torvalds  torvalds@linux-foundation.org
1514489303 -0500    David S. Miller davem@davemloft.net
1514487644 -0800    Tom Herbert tom@quantonium.net
1514487643 -0800    Tom Herbert tom@quantonium.net
1514482693 -0500    Willem de Bruijn    willemb@google.com
...

It includes the UNIX timestamp (in seconds since epoch), a whitespace, the time zone (where the authors live in), a tab separator, the name of the author, a tab and the email address of the author. The whole log shows 13 years of Linux development that is available on GitHub repository mirror.

Wrangling the raw data

We import the data by using Pandas' read_csv function and the appropriate parameters. We copy only the needed data from the raw dataset into the new DataFrame git_authors.


In [1]:
import pandas as pd

raw = pd.read_csv(
    r'../../linux/git_timestamp_author_email.log',
    sep="\t",
    encoding="latin-1",
    header=None,
    names=['unix_timestamp', 'author', 'email'])

# create separate columns for time data
raw[['timestamp', 'timezone']] = raw['unix_timestamp'].str.split(" ", expand=True)
# convert timestamp data
raw['timestamp'] = pd.to_datetime(raw['timestamp'], unit="s")
# add hourly offset data
raw['timezone_offset'] = pd.to_numeric(raw['timezone']) / 100.0
# calculate the local time
raw["timestamp_local"] = raw['timestamp'] + pd.to_timedelta(raw['timezone_offset'], unit='h')

# filter out wrong timestamps
raw = raw[
    (raw['timestamp'] >= raw.iloc[-1]['timestamp']) &
    (raw['timestamp'] <= pd.to_datetime('today'))]

git_authors = raw[['timestamp_local', 'timezone', 'author']].copy()
git_authors.head()


Out[1]:
timestamp_local timezone author
0 2017-12-31 14:47:43 -0800 Linus Torvalds
1 2017-12-31 13:13:56 -0800 Linus Torvalds
2 2017-12-31 13:03:05 -0800 Linus Torvalds
3 2017-12-31 12:30:34 -0800 Linus Torvalds
4 2017-12-31 12:29:02 -0800 Linus Torvalds

Refining the dataset

In this section, we add some additional time-based information to the DataFrame to accomplish our tasks.

Adding weekdays

First, we add the information about the weekdays based on the weekday_name information of the timestamp_local column. Because we want to preserve the order of the weekdays, we convert the weekday entries to a Categorial data type, too. The order of the weekdays is taken from the calendar module.

Note: We can do this so easily because we have such a large amount of data where every weekday occurs. If we can't be sure to have a continuous sequence of weekdays, we have to use something like the pd.Grouper method to fill in missing weekdays.


In [2]:
import calendar

git_authors['weekday'] = git_authors["timestamp_local"].dt.weekday_name
git_authors['weekday'] = pd.Categorical(
    git_authors['weekday'], 
    categories=calendar.day_name,
    ordered=True)
git_authors.head()


Out[2]:
timestamp_local timezone author weekday
0 2017-12-31 14:47:43 -0800 Linus Torvalds Sunday
1 2017-12-31 13:13:56 -0800 Linus Torvalds Sunday
2 2017-12-31 13:03:05 -0800 Linus Torvalds Sunday
3 2017-12-31 12:30:34 -0800 Linus Torvalds Sunday
4 2017-12-31 12:29:02 -0800 Linus Torvalds Sunday

Adding working hours

For the working hour analysis, we extract the hour information from the timestamp_local column.

Note: Again, we assume that every hour is in the dataset.


In [3]:
git_authors['hour'] = git_authors['timestamp_local'].dt.hour
git_authors.head()


Out[3]:
timestamp_local timezone author weekday hour
0 2017-12-31 14:47:43 -0800 Linus Torvalds Sunday 14
1 2017-12-31 13:13:56 -0800 Linus Torvalds Sunday 13
2 2017-12-31 13:03:05 -0800 Linus Torvalds Sunday 13
3 2017-12-31 12:30:34 -0800 Linus Torvalds Sunday 12
4 2017-12-31 12:29:02 -0800 Linus Torvalds Sunday 12

Analyzing the data

With the prepared git_authors DataFrame, we are now able to deliver insights into the past years of development.

Developers' timezones

First, we want to know where the developers roughly live. For this, we plot the values of the timezone columns as a pie chart.


In [4]:
%matplotlib inline
timezones = git_authors['timezone'].value_counts()
timezones.plot(
    kind='pie',
    figsize=(7,7),
    title="Developers' timezones",
    label="")


Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e0eefbc5c0>

Result

The majority of the developers' commits come from the time zones +0100, +0200 and -0700. With most commits coming probably from the West Coast of the USA, this might just be an indicator that Linus Torvalds lives there ;-) . But there are also many commits from developers within Western Europe.

Weekdays with the most commits

Next, we want to know on which days the developers are working during the week. We count by the weekdays but avoid sorting the results to keep the order along with our categories. We plot the result as a standard bar chart.


In [5]:
ax = git_authors['weekday'].\
        value_counts(sort=False).\
        plot(
            kind='bar',
            title="Commits per weekday")
ax.set_xlabel('weekday')
ax.set_ylabel('# commits')


Out[5]:
<matplotlib.text.Text at 0x1e0ef2960b8>

Result

Most of the commits occur during normal working days with a slight peak on Wednesday. There are relatively few commits happening on weekends.

Working behavior of the main contributor

It would be very interesting and easy to see when Linus Torvalds (the main contributor to Linux) is working. But we won't do that because the yet unwritten codex of Software Analytics does tell us that it's not OK to analyze a single person's behavior – especially when such an analysis is based on an uncleaned dataset as we have it here.

Usual working hours

To find out about the working habits of the contributors, we group the commits by hour and count the entries (in this case we choose author) to see if there are any irregularities. Again, we plot the results with a standard bar chart.


In [6]:
ax = git_authors\
    .groupby(['hour'])['author']\
    .count().plot(kind='bar')
ax.set_title("Distribution of working hours")
ax.yaxis.set_label_text("# commits")
ax.xaxis.set_label_text("hour")


Out[6]:
<matplotlib.text.Text at 0x1e0ef2497f0>

Result

The distribution of the working hours is interesting:

  • First, we can clearly see that there is a dent around 12:00. So this might be an indicator that developers have lunch at regular times (which is a good thing IMHO).
  • Another not so typical result is the slight rise after 20:00. This could be interpreted as the development activity of free-time developers that code for Linux after their day-time job.
  • Nevertheless, most of the developers seem to get a decent amount of sleep indicated by low commit activity from 1:00 to 7:00.

Signs of overtime

At last, we have a look at possible overtime periods by creating a simple model. We first group all commits on a weekly basis per authors. As grouping function, we choose max() to get the hour where each author committed at latest per week.


In [7]:
latest_hour_per_week  = git_authors.groupby(
    [
      pd.Grouper( key='timestamp_local', freq='1w'), 
      'author'
    ]
    )[['hour']].max()

latest_hour_per_week.head()


Out[7]:
hour
timestamp_local author
2005-04-17 Adrian Bunk 15
Alexander Nyberg 15
Andi Kleen 15
Andrea Arcangeli 15
Andrew Vasquez 15

Next, we want to know if there were any stressful time periods that forced the developers to work overtime over a longer period of time. We calculate the mean of all late stays of all authors for each week.


In [8]:
mean_latest_hours_per_week = \
    latest_hour_per_week \
    .reset_index().groupby('timestamp_local').mean()
mean_latest_hours_per_week.head()


Out[8]:
hour
timestamp_local
2005-04-17 14.956522
2005-04-24 16.494382
2005-05-01 12.349398
2005-05-08 15.979798
2005-05-15 15.984127

We also create a trend line that shows how the contributors are working over the span of the past years. We use the polyfit function from numpy for this which needs a numeric index to calculate the polynomial coefficients later on. We then calculate the coefficients with a three-dimensional polynomial based on the hours of the mean_latest_hours_per_week DataFrame. For visualization, we decrease the number of degrees and calculate the y-coordinates for all weeks that are encoded in numeric_index. We store the result in the mean_latest_hours_per_week DataFrame.


In [9]:
import numpy as np

numeric_index = range(0, len(mean_latest_hours_per_week))
coefficients = np.polyfit(numeric_index, mean_latest_hours_per_week.hour, 3)
polynomial = np.poly1d(coefficients)
ys = polynomial(numeric_index)
mean_latest_hours_per_week['trend'] = ys
mean_latest_hours_per_week.head()


Out[9]:
hour trend
timestamp_local
2005-04-17 14.956522 14.742989
2005-04-24 16.494382 14.743842
2005-05-01 12.349398 14.744694
2005-05-08 15.979798 14.745545
2005-05-15 15.984127 14.746396

At last, we plot the hour results of the mean_latest_hours_per_week DataFrame as well as the trend data in one line plot.


In [10]:
ax = mean_latest_hours_per_week[['hour', 'trend']].plot(
    figsize=(10, 6), 
    color=['grey','blue'], 
    title="Late hours per weeks")
ax.set_xlabel("time")
ax.set_ylabel("hour")


Out[10]:
<matplotlib.text.Text at 0x1e0ef28f4a8>

Result

We see no sign of significant overtime periods over 13 years of Linux development. Shortly after the creation of the Git mirror repository, there might have been a time with some irregularities. But overall, there are no signs of death marches. It seems that the Linux development team has established a stable development process.

Closing remarks

Again, we've seen that various metrics and results can be easily created from a simple Git log output file. With Pandas, it's possible to get to know the habits of the developers of software projects. Thanks to Jupyter's open notebook approach, we can easily adapt existing analysis and add situation-specific information to it as we go along.